Colorectal cancer prognosis based on dietary pattern using synthetic minority oversampling technique with K-nearest neighbors approach

Generally, a person’s life span depends on their food consumption because it may cause deadly diseases like colorectal cancer (CRC). In 2020, colorectal cancer accounted for one million fatalities globally, representing 10% of all cancer casualties. 76,679 males and 78,213 females over the age of 59 from ten states in the United States participated in this analysis. During follow-up, 1378 men and 981 women were diagnosed with colon cancer. This prospective cohort study used 231 food items and their variants as input features to identify CRC patients. Before labelling any foods as colorectal cancer-causing foods, it is ethical to analyse facts like how many grams of food should be consumed daily and how many times a week. This research examines five classification algorithms on real-time datasets: K-Nearest Neighbour (KNN), Decision Tree (DT), Random Forest (RF), Logistic Regression with Classifier Chain (LRCC), and Logistic Regression with Label Powerset (LRLC). Then, the SMOTE algorithm is applied to deal with and identify imbalances in the data. Our study shows that eating more than 10 g/d of low-fat butter in bread (RR 1.99, CI 0.91–4.39) and more than twice a week (RR 1.49, CI 0.93–2.38) increases CRC risk. Concerning beef, eating in excess of 74 g of beef steak daily (RR 0.88, CI 0.50–1.55) and having it more than once a week (RR 0.88, CI 0.62–1.23) decreases the risk of CRC, respectively. While eating beef and dairy products in a daily diet should be cautious about quantity. Consuming those items in moderation on a regular basis will protect us against CRC risk. Meanwhile, a high intake of poultry (RR 0.2, CI 0.05–0.81), fish (RR 0.82, CI 0.31–2.16), and pork (RR 0.67, CI 0.17–2.65) consumption negatively correlates to CRC hazards.

Random forest (RF).Random Forest combines various decision tree outputs and based on the dense forest, it had a high possibility of improving accuracy [37][38][39] .RF has the technology of ensemble learning to handle multiple outputs and improve performance by solving complex problems.Surprisingly RF only takes less time to train models.One key feature of RF is that it can prevent model overfitting 40 , which is the major issue in our healthcare dataset 41 .RF performs multilabel classification.Logistic regression.Logistic regression could make better results on classification and regression problems [42][43][44] .
Here we didn't use raw logistic regression.We took the method of this algorithm and customized it with our use case based on the runtime and used two different methods using logistic regression.The first is Logistic Regression with Classifier Chain (LRCC), and the second is Logistic Regression with Label Powerset (LRLC).The Classifier Chain models could arrange every chain randomly [45][46][47] .This chain will be an optimal ordering of the other classes, making the best performance while training the model.The core logic of Label Powerset is to transform all the labels into one unique label with the combinations found on the data.The model showed high computation complexity even for the worst case of (2^|C|).Both methods showed enhanced and optimized results than the raw logistic regression.

Literature study
Colorectal cancer (CRC) remains a significant health concern even in today's technologically advanced world.Several epidemiological studies have outlined the role of dietary factors in colorectal cancer incidence.Numerous studies on daily dietary patterns reveal that red and processed meat consumption is strongly connected to CRC risk.In 2020, Bradbury et al. 48investigated the correlation between red and processed meat consumption and CRC risk using the UK Biobank dataset, confirming that eating red and processed meat increases CRC risk.Notably, ingesting more than 54 g of red meat and 25 g of processed meat doubled the risk of CRC.Likewise, a study conducted by Feng et al. 49 in 2021, using the UK Biobank dataset and Mehta et al. 50in the same year, using data from the USA and Puerto Rico, as well as Bernstein et al. 51 in 2015, based on USA dataset, furtherly strengthened these results.All these studies collectively underscore the heightened risk of CRC associated with increased red and processed meat consumption.

Questionnaire
All participants were asked three types of questions, and data were collected.Baseline Questionnaire (BQ) asked all participants for basic information such as age, occupation, smoking, drinking, and medical history, and 96.8% of participants completed it.All participants were encouraged to complete the Diet History Questionnaire (DHQ) to collect information on their daily food intake.These DHQs provide information about the daily grams and daily frequencies of various foods and beverages consumed by participants on a daily basis, and 77% of participants completed the DHQ.Only intervention arm participants were provided with the Dietary Questionnaire (DQX).And, like the DHQ, it will also contain basic details of the questionnaire, like daily foods and frequency of food consumption.These raw data are then systematically processed and converted into variables suitable for analysis, such as gram intake and daily food frequency.

Dietary exposure assessment
All participants were questioned about alcohol servings and serving sizes, and obtained values were converted into grams using DietCalc software.Regarding food values, DietCalc was used to compute DHQ nutritional values based on food frequency, serving size, and other questionnaire data.Whereas food frequency values reflect how frequently a person consumes the food without concern for the portion size of that particular food.Sometimes, when calculating gram amounts of certain foods and food frequencies, more than one food response contributed.In such cases, a specific food's gram or frequency values are calculated by adding the grams or frequency of all the food items together.

Cancer ascertainment
During the PLCO trial, participants' colorectal cancer was confirmed through Medical Record Abstraction (MRA), self-reports, family reports, and death certificates.If any confirmed invasive tumors, in situ cancers, or borderline malignancies have been identified during an annual cancer screening, they are considered the cancer endpoint.In certain situations, when clear MRA records were not accessible, existing and/or available medical records were analyzed and summarized.Also, follow-up activity continued when documentary records were not public.Most importantly, if the MRA approach finds no confirmation of colorectal cancer diagnosis, even if records confirming colorectal cancer in self-reports, family reports, and death certificates exist, they are not considered or documented as confirmed colorectal cancer evidence.

Statistical analysis
The proposed prospective cohort study's primary objective is to develop a precise ML model to determine the association between colorectal cancer risk and participants' daily dietary habits.It also investigates how daily consumed food accelerates colorectal cancer.The colorectal cancer prediction model was designed using SMOTE technique and the KNN algorithm.Furthermore, risk ratio (RR) and 95% confidence interval (CI) values were used to determine cancer hazards.The food consumption parameters of the participants are ranked separately into alcohol, beef, butter, cheese, milk, yoghurt, chicken, fish, and pork consumption values as per the research needs.For research objectives, the aforementioned participants' food consumption values were converted into grams and utilized.Following that, study compared participants' daily food consumption value (g) and frequency of intake of a specific food per week with CRC cancer incidence and classified the findings by food category.
The research identifies which types of food intake increase colorectal cancer threats by analyzing statistical data on daily food consumption and frequency values with research outcomes.The research outcomes are grouped into two different food types: vegetarian and non-vegetarian.The findings classify potential causes of cancer into five groups: a. Positive relations These foods exhibit a significant positive association with CRC risk.Output values of daily intake of specific food had a positive relationship, whereas the output values of food frequency also had a positive relationship.Such foods have been discussed in this section.b.Negative relations These foods are negatively related to CRC.Output values of daily intake of specific food had a negative link, as well as the output values of food frequency also had a negative connection.Such foods have been discussed in this section.c.No positive relation These foods have not been positively linked to CRC hazards.Let us consider that the participants consume a particular food.The output values of the participant's daily food intake have a positive relationship, whereas the output values of the participants' frequent food intake have a negative relationship.Similarly, the output values of the participants' daily food intake showed a negative correlation, but the output values of the participants' frequent food intake had a positive association.In such an uncommon circumstance, our research considered to be that these dietary values had no positive relation with CRC cancer.d.No negative relation These foods have also not been positively linked to CRC threat.The output values of the participant's daily food intake have a negative relationship, and the frequency of food intake has insufficient values.In such a situation, our research considered these food items do not cause CRC threats and considered them as having no negative relation.e. Moderate relation Some foods are neither healthy nor hazardous for our health.That is, there is no substantial positive or negative link between that specific food and cancer.This section provides information about such foods.

Multilabel accuracy prediction
Calculating the accuracy of multilabel classification is a bit challenging and tricky.Determining the appropriate level of accuracy for each of these models adds complexity to the analysis.A generic function is essential to provide accuracy for all methods.The set of predicted labels in y_pred must match the corresponding set in y_true.
Finally got the mean value of the y_pred and y_true arrays and returned it.This process would be handled in multiple phases, repeated according to the label size.Adopting a healthy diet and lifestyle may reduce the chance of developing colorectal cancer.A detailed analysis of food, its quantity and frequency of intake is essential to lessen the CRC hazards.This research applies the SMOTE approach to all ML models to eliminate the data imbalances in the PLCO dataset and improve the model's prediction.From the experimental results, it was proved that KNN-SMOTE showed enhanced accuracy than the rest of the ML models.The primary aim of the study subsequent to the implementation of KNN-SMOTE on the dataset is as follows: To investigate the correlations between red/white meat consumption and CRC incidence.To scrutinize how a certain food in a specific amount and certain frequency affects CRC likelihood.To research how a particular food combination would increase or decrease the CRC risk.

Risk ratio (RR) and confidence interval (CI)
The risk associated with the particular food intake in specific quantity (grams) and particular frequency is evaluated using Risk Ratio (RR) and 95% Confidence Interval (CI).
The Confidence Interval (CI) measures the level of certainty or uncertainty around the relationship between CRC and food intake (g).A 95% CI is a statistical metric that provides a level of confidence of 95% about the estimated range of certainty and uncertainty.
The 95% CI is calculated as: The CI is completely dependent on RR.If RR is < 1, then there is no risk of CRC.If RR = 1, then there may or may not be a risk associated with CRC in consuming the food items.If RR > 1, then there is a high-risk factor associated with CRC. (

Preprocessing
There was a lot of background noise in the dataset.Some features were of no use for training in ML models.Therefore, such columns were removed to decrease the model training time and increase the effectiveness.Since we already had an "age level" column, we decided to remove the "age" column.PLCO_id is only a patient's unique identifier and can't be used in model training in any way.Subsequently, duplicates and erroneous values in dietary information were meticulously identified and eliminated, along with columns having blank, null, or negative entries.A total of 27 fields were removed from the original dataset, making 204 features available for further processing.ML models cannot handle multiple data types within a single column.All rows with data of different types in the same column were evaluated carefully, and identical rows were created where necessary.

Usage of synthetic minority oversampling technique (SMOTE)
The real-time datasets help to quickly train our model with different ML classifiers.The model consistently achieves a validation accuracy of 95%.The reason behind this magical accuracy was that there would be limited negative or positive data.For instance, the corona fever death dataset.Corona deaths were only 4%-5%.So, while feeding this data into the ML model, it'll learn 95% false and 5% true information.Once the fitting mechanism has completed its execution, it becomes necessary to compute the model's accuracy.It is important to note that the model is only based on 5% of the true data.The model consistently provides erroneous accuracy rates.Those issues were called "overfitting" with a real-time dataset.This dataset has an uneven distribution of values.Therefore, we need to ensure that the dataset is well-balanced.
The SMOTE method is used to solve this problem.The primary goal of this SMOTE method was to use various KNN methods to rebalance the original dataset.The process of transforming the imbalanced dataset into a balanced dataset was accomplished using two techniques.Those were oversampling and undersampling.Undersampling simply reduces a lot of valuable data from the dataset.This could eliminate most of the valuable data without any measurement.So, this method will not be valid for our use case.The Oversampling approach is another option for SMOTE.This would create a synthetic dataset that matches our original dataset to achieve balance.

Result and discussion
The primary goal of the research is to compare which foods are known to cause CRC cancer and which food consumption raises CRC risks.Finally, the proposed model compares the values of various food types intake and CRC cancer incidence.This investigation recommends people who read the research findings be cautious when choosing foods.And the study's findings are not intended to frighten people.

Performance appraisal
The proposed model is trained using five ML algorithms namely KNN, RF, DT, LRCC and LRLP.The performance of the classifiers is evaluated based on the accuracy of the employed ML models before and after applying SMOTE.

Accuracy comparison
Table 1 shows the accuracy comparison for all the methods before and after applying the SMOTE.From Table 1, it would be obvious that KNN performed better in classification accuracy after applying SMOTE.Thus, the study used KNN as the ML model for training the dataset to identify CRC risk factors.

Performance metrics evaluation for KNN
The proposed KNN algorithm's effectiveness is measured using precision, recall, specificity, and F1-score.Table 1 shows that our ML model achieved 98.43% accuracy before employing SMOTE.Fewer data points seemed to have the value '1' .Hence, the proposed model is erroneously assumed and trained.During the accuracy calculation, the classifier misdiagnosed each parameter as a '0' value.This is the reason for the extremely high accuracy before using SMOTE.After applying SMOTE, the study obtained an 85% success rate.At the same time, the model performed even better when identifying cancer patients from healthy participants.Consequently, the proposed SMOTE-KNN algorithm would correctly predict cancerous and non-cancerous individuals concerning food quantity and frequency.A visual representation of the efficiency metrics used to evaluate the proposed model is shown in Fig. 1.

Result analysis
The primary goal of the research is to compare which foods are known to cause CRC cancer and which food consumption raises CRC risks.Additionally, the study compares intake levels of various food types with CRC cancer incidence.This investigation recommends people who read the research findings be cautious when choosing foods.And the study's findings are not intended to frighten people.Research outcomes have been described in two different sections.
Section 1 Based on the relationship between food consumption value, frequency, and CRC cancer incidence, the type of food that stimulates CRC risk will be identified.
Section 2 The study addressed which kind of food combinations (mixed foods) increase the CRC risk.Therefore, we estimated the risk ratio and 95% confidence interval in all sections to classify the risk factors.

Daily food consumption and food frequency responses
This section compares the value of the participants' daily food consumption with the values of how frequently they consume those foods each week and discusses the risk values of the specific foods.Serving size is used to represent the food intake values of participants.Based on total food consumption, the serving size values for each food fluctuate.Table 2 uses the following values for serving size: One serving size of liquor was 100 g.One serving size of beef, fish, pork, and poultry was 74 g.One serving size of butter, cheese, and ice cream was 10 g.One serving size of milk and yogurt was 125 g.Furthermore, Table 2 shows the risk ratio based on daily food consumption, whereas Table 3 displays the risk ratio for weekly food consumption.Table 3. Association between food consumption frequency and CRC risk.'-' , not applicable (no data/studies available to calculate RR, CI). 1 time-0 to 0.15 g, 2 times-0.16 to 0.3 g, 3 times-0.31to 0.45 g, 4 times-0.46 to 0.6 g, 5 times-0.61 to 0.75 g, 6 times-0.76 to 0.9 g, > 6 times-≥ 0.91 g. 0.07-1.13),and eating mixing with other foods (RR 0.9, CI 0.71-1.14)all reduces the severity of CRC cancer.

Positive relation
In terms of weekly intaking, consuming 2% fat milk with regular foods (excluding coffee, tea, or cereal) at least thrice a week (RR 0.91, CI 0.67-1.23)and eating soy milk with everyday foods (excluding coffee, tea, or cereals) more than twice in a week (RR 0.29, CI 0.07-1.15)was negatively correlated with CRC risk.Ingesting skimmed milk with grains or cereal more than once a week (RR 0.89, CI 0.71-1.10),drinking skimmed milk with coffee or tea more than twice a week (RR 0.76, CI 0.43-1.34),and eating skim milk with regular foods more than once a week (RR 0.83, CI 0.64-1.07)all decrease the likelihood of CRC formation.Surprisingly, eating over than 125 g of fresh yogurt (RR 0.82, CI 0.66-1.01)per day and having it at least twice a week (RR 0.7, CI 0.55-0.89)significantly decreases CRC risk.Similarly, consuming more than 125 g of frozen yogurt daily (RR 0.51, CI 0.24-1.07)and ingesting it twice a week (RR 0.71, CI 0.47-1.07)were both negatively associated with CRC formation.

Chicken
There are two varieties of chicken flesh: dark chicken meat and white chicken meat.Dark chicken flesh comes from chicken thighs and drumsticks (chicken legs), whereas white chicken meat comes from chicken breasts and wings.Daily consuming more than 74 g of dark chicken meat (RR 0.66, CI 0.09-4.66),white chicken meat (RR 0. www.nature.com/scientificreports/0.36-1.59),and taking white meat chicken without skin (RR 0.73, CI 0.59-0.92)all dramatically reduce CRC hazards.Figure 5 depicts some food items that are not negatively associated with CRC.

Dairy products
Butter, cheese and milk.Overconsumption of regular butter over 10 g with potatoes every day (RR 0.91, CI 0.48-1.75)and eating almost twice per week (RR 0.82, CI 0.67-0.99)both lower the risk of CRC.Reduced-fat butter eating with any dishes at least twice a week (RR 0.72, CI 0.37-1.37)and having regular butter with pancakes and/or waffles twice a week (RR 0.8, CI 0.48-1.35)significantly reduced CRC risk.Likewise, lessened CRC risk was confirmed for those who consumed more than 10 g of regular cheese every day (RR 0.93, CI 0.78-1.10)and those who ate it more than twice a week (RR 0.91, CI 0.79-1.05).Additionally, soy milk intake in coffee and tea almost twice a week (RR 0.62, CI 0.09-4.36)decreases CRC hazard.

Butter-ice cream and ice milk
In accordance with our results show that there was no definitive link between CRC cancer and the consumption of low-fat ice cream, regular ice cream, and ice milk.Figure 5 depicts some food items that are moderately related to CRC.

Conclusions and future needs
The proposed research is based on food, aiming to find out what kind of food intake stimulates the cells that cause deadly diseases like CRC cancer.The study compares several machine learning models on a real-time dataset and identifies which algorithm performs best in accuracy and other performance metrics.Since the research used real-time information, the dataset had asymmetric input and output values.The SMOTE technology was used to handle these.The SMOTE model artificially augments the minority class data (input) to be equivalent to the majority class data (output).Then, all ML model accuracies were calculated before and after applying SMOTE to the dataset.Finally, after solving the imbalance problems using SMOTE, all ML models predicted the CRC patients more accurately.At the end of the study, the SMOTE-KNN classification algorithm performed well on a real-time dataset and identified CRC patients with high accuracy.
The proposed study examines the common foods individuals consume daily and various subtypes of those foods.Red meat intake, specifically beef stew, roasted beef, and sandwiches with roasted beef, revealed a positive connection with CRC risk, in line with previous epidemiological research [48][49][50][51] .Furthermore, our findings demonstrate that eating particular forms of beef, such as beef steaks, roast beef with sandwiches, beef burgers, ground beef or meatloaf, and lean beef steak in excess of 74 g daily and no more than once a week significantly lowers CRC risk.In accordance with these findings, it is clear that not all red meat consumption would increase the CRC hazards.Eating hot dogs, lasagna, ravioli, and shell foods has recently been widespread in many places.Our research also examines those foods under the mixed foods segment and finds that eating hot dogs, lasagna, ravioli, and shell foods dramatically reduces CRC.However, consuming beef meatballs and beef or pork liver elevated the CRC risk.Overall, our research findings reveal a complex link between red meat consumption and CRC risk, emphasizing the significance of including particular types and amounts of red meat in dietary recommendations.
The majority of the epidemiological research suggested that poultry intake mitigates CRC risk.However, in practical terms, most of us have the chance to taste multiple subtypes of chicken, like chicken wings, chicken legs, boneless chicken, chicken with skin, and skinless chicken etc.Therefore, in addition to focusing on normal chicken meat, it is essential to investigate different varieties to understand their impact on CRC risk better.Our study results indicate that daily consumption of over 74 g of dark chicken, white chicken, fried chicken, ground chicken, and turkey is negatively associated with CRC incidence.Eating in excess of 74 g daily of non-fat and non-fried fish, tuna, canned tuna, and pork was found to reduce the risk, which aligns with previous results 52,53 .Moreover, being careful about the amount while eating certain types of chicken can protect us from CRC.The findings reveal that consuming more than 74 g of non-fried skinless chicken dark meat and fried white meat chicken with skin on a daily basis promotes CRC risk.Rather than suggesting that all poultry ingestion enhances the CRC threat 54 , our finding clearly highlights which form of chicken varieties and to what limits may raise the risk.
Our findings reveal some crucial links between dairy products and CRC risk.Nowadays, we have seen lots of chemical-added products available in the market and ingestion by most individuals.To understand the complex relationship between dairy products and CRC risk, an investigation should focus on chemical-added products.The proposed research comprehensively examined dairy products and their different subtypes.Notably, our findings confirms that consuming natural chemical-free butter, low-fat and fat-free cheese, reduced-fat and skimmed milk, and fresh and frozen yoghurt reduces CRC, consistent with previous studies 55,56 results.On the flip side, when examining the value-added byproducts, it exposes the shocking correlation that increased CRC while eating low-fat milk and non-fortified milk mixed with cereal, eating cream cheese, and ingesting reducedfat butter with bread and vegetables.According to our findings, it is evident that natural dairy products like

Figure 5 .
Figure 5. Foods no negatively related (NNR) & moderate related (MR) to colorectal cancer.Note: NNR and MR a1 denotes daily consumption.NNR and MR b1 denotes weekly consumption.
DatasetThe NCI and CDAS provide us with real-time datasets.To the best of our knowledge, the study further affirms that the utilized NCI dataset is the most up-to-date version available.The dataset has 231 input characteristics, two output features, and about 1,55,000 records of the participants.It has 2359 cancer records.Cancer patients constitute only 2% of the whole dataset.

Table 2 .
CRCBeefAll kinds of beef do not increase CRC risks.Conversely, consuming specific types of beef is linked to a decreased risk of CRC.Consumption of beef steaks in excess of 75 g daily (RR 0.88, CI 0.50-1.55)andhaving it more than once a week (RR 0.88, CI 0.62-1.23)significantlydecreased the CRC hazards.Figure3depicts a few foods that reduce the CRC risk.